{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4396197f",
   "metadata": {},
   "source": [
    "# Using LLMs To Summarize Personal Research\n",
    "\n",
    "Our goal is to have LLM aid us in generating interview quetions for someone. I find that I'm constantly trying to ramp up to a person's background and story when preparing to meet them.\n",
    "\n",
    "There is a ton of awesome resources about a person online we can use\n",
    "\n",
    "* Twitter Profiles\n",
    "* Websites\n",
    "* Other Interviews (YouTube or Text)\n",
    "\n",
    "Let's bring all these together by first pulling the information and then generating questions or bullet points we can use as preparation.\n",
    "\n",
    "First let's import our packages! We'll be using LangChain to help us interact with OpenAI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "3311cda3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# LLMs\n",
    "from langchain import PromptTemplate\n",
    "from langchain.llms import OpenAI\n",
    "from langchain.chat_models import ChatOpenAI\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "from langchain.chains.summarize import load_summarize_chain\n",
    "from langchain.prompts import PromptTemplate\n",
    "\n",
    "# Twitter\n",
    "import tweepy\n",
    "\n",
    "# Scraping\n",
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "from markdownify import markdownify as md\n",
    "\n",
    "# YouTube\n",
    "from langchain.document_loaders import YoutubeLoader\n",
    "# !pip install youtube-transcript-api\n",
    "\n",
    "# Environment Variables\n",
    "import os\n",
    "from dotenv import load_dotenv\n",
    "\n",
    "load_dotenv()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1578a7f5",
   "metadata": {},
   "source": [
    "You'll need a few API keys to complete the script below. It's modular so if you don't want to pull from Twitter feel free to leave those blank"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "394a7c4e",
   "metadata": {},
   "outputs": [],
   "source": [
    "TWITTER_API_KEY = os.getenv('TWITTER_API_KEY', 'YourAPIKeyIfNotSet')\n",
    "TWITTER_API_SECRET = os.getenv('TWITTER_API_SECRET', 'YourAPIKeyIfNotSet')\n",
    "TWITTER_ACCESS_TOKEN = os.getenv('TWITTER_ACCESS_TOKEN', 'YourAPIKeyIfNotSet')\n",
    "TWITTER_ACCESS_TOKEN_SECRET = os.getenv('TWITTER_ACCESS_TOKEN_SECRET', 'YourAPIKeyIfNotSet')\n",
    "OPENAI_API_KEY = os.getenv('OPENAI_API_KEY', 'YourAPIKeyIfNotSet')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc50fe5e",
   "metadata": {},
   "source": [
    "For this tutorial, let's pretend we are going to be interviewing [Elad Gil](https://eladgil.com/) since he has a bunch of content online\n",
    "\n",
    "### Pulling Data From Twitter\n",
    "Great, now let's set up a function that will pull tweets for us. This will help us get current events that the user is talking about. I'm excluding replies since they usually don't have a ton of high signal text from the user. This is the same code that was used in the [Twitter AI Bot tutorial](https://youtu.be/yLWLDjT01q8)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "edd68521",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_original_tweets(screen_name, tweets_to_pull=80, tweets_to_return=80):\n",
    "    \n",
    "    # Tweepy set up\n",
    "    auth = tweepy.OAuthHandler(TWITTER_API_KEY, TWITTER_API_SECRET)\n",
    "    auth.set_access_token(TWITTER_ACCESS_TOKEN, TWITTER_ACCESS_TOKEN_SECRET)\n",
    "    api = tweepy.API(auth)\n",
    "\n",
    "    # Holder for the tweets you'll find\n",
    "    tweets = []\n",
    "    \n",
    "    # Go and pull the tweets\n",
    "    tweepy_results = tweepy.Cursor(api.user_timeline,\n",
    "                                   screen_name=screen_name,\n",
    "                                   tweet_mode='extended',\n",
    "                                   exclude_replies=True).items(tweets_to_pull)\n",
    "    \n",
    "    # Run through tweets and remove retweets and quote tweets so we can only look at a user's raw emotions\n",
    "    for status in tweepy_results:\n",
    "        if hasattr(status, 'retweeted_status') or hasattr(status, 'quoted_status'):\n",
    "            # Skip if it's a retweet or quote tweet\n",
    "            continue\n",
    "        else:\n",
    "            tweets.append({'full_text': status.full_text, 'likes': status.favorite_count})\n",
    "\n",
    "    \n",
    "    # Sort the tweets by number of likes. This will help us short_list the top ones later\n",
    "    sorted_tweets = sorted(tweets, key=lambda x: x['likes'], reverse=True)\n",
    "\n",
    "    # Get the text and drop the like count from the dictionary\n",
    "    full_text = [x['full_text'] for x in sorted_tweets][:tweets_to_return]\n",
    "    \n",
    "    # Convert the list of tweets into a string of tweets we can use in the prompt later\n",
    "    users_tweets = \"\\n\\n\".join(full_text)\n",
    "            \n",
    "    return users_tweets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c968c57",
   "metadata": {},
   "source": [
    "Ok cool, let's try it out!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "c9871622",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "More AI companies with sudden virality + paying customers should just bootstrap\n",
      "\n",
      "0. Running co for cash may be best success\n",
      "\n",
      "1. If it does scale, being profitable or near to it creates lot of options\n",
      "\n",
      "2. it may not scale, or only work for a few months\n",
      "\n",
      "3. Why get on the… https://t.co/Q9TRQo4yau\n",
      "\n",
      "Som\n"
     ]
    }
   ],
   "source": [
    "user_tweets = get_original_tweets(\"eladgil\")\n",
    "print (user_tweets[:300])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bba8191a",
   "metadata": {},
   "source": [
    "Awesome, now we have a few tweets let's move onto pulling data from a web page or two.\n",
    "\n",
    "### Pulling Data From Websites\n",
    "\n",
    "Let's do two pages\n",
    "\n",
    "1. His personal website which has his background - https://eladgil.com/\n",
    "2. One of my favorite blog posts from him around AI defensibility & moats - https://blog.eladgil.com/p/defensibility-and-competition\n",
    "\n",
    "First let's create a function that will scrape a website for us.\n",
    "\n",
    "We'll do this by pulling the raw html, put it in a BeautifulSoup object, then convert that object to Markdown for better parsing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "67c5c796",
   "metadata": {},
   "outputs": [],
   "source": [
    "def pull_from_website(url):\n",
    "    \n",
    "    # Doing a try in case it doesn't work\n",
    "    try:\n",
    "        response = requests.get(url)\n",
    "    except:\n",
    "        # In case it doesn't work\n",
    "        print (\"Whoops, error\")\n",
    "        return\n",
    "    \n",
    "    # Put your response in a beautiful soup\n",
    "    soup = BeautifulSoup(response.text, 'html.parser')\n",
    "    \n",
    "    # Get your text\n",
    "    text = soup.get_text()\n",
    "\n",
    "    # Convert your html to markdown. This reduces tokens and noise\n",
    "    text = md(text)\n",
    "     \n",
    "    return text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a24e408e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# I'm going to store my website data in a simple string.\n",
    "# There is likely optimization to make this better but it's a solid 80% solution\n",
    "\n",
    "website_data = \"\"\n",
    "urls = [\"https://eladgil.com/\", \"https://blog.eladgil.com/p/defensibility-and-competition\"]\n",
    "\n",
    "for url in urls:\n",
    "    text = pull_from_website(url)\n",
    "    \n",
    "    website_data += text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a05784b1",
   "metadata": {},
   "source": [
    "Awesome, now that we have both of those data sources, let's check out a sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "207a6b1a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n",
      "Elad Gil\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Welcome to Elad Gil's retro homepage!\n",
      "\n",
      " Who? I am a technology entrepreneur. LinkedIn profile is here.\n",
      "What?\n",
      "I am an investor or advisor to companies including Airbnb, Airtable, Anduril, Brex, Checkr, Coinbase, dbt Labs, Deel, Figma, Flexport, Gitlab, Gusto, Instacart, Navan, Notion, Opendoor, PagerDuty, Pinterest, Retool, Rippling, Samsara, Square, Stripe\n",
      "I am involved with AI com\n"
     ]
    }
   ],
   "source": [
    "print (website_data[:400])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b7ba6f4",
   "metadata": {},
   "source": [
    "Awesome, to round us off, let's get the information from a youtube video. YouTube has tons of data like Podcasts and interviews. This will be valuable for us to have.\n",
    "\n",
    "### Pulling Data From YouTube\n",
    "\n",
    "We'll use LangChains YouTube loaders for this. It only works if there is a transcript on the YT video already, if there isn't then we'll move on. You could get the transcript via Whisper if you really wanted to, but that's out of scope for today.\n",
    "\n",
    "We'll make a function we can use to loop through videos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "9ee1b47d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pulling data from YouTube in text form\n",
    "def get_video_transcripts(url):\n",
    "    loader = YoutubeLoader.from_youtube_url(url, add_video_info=True)\n",
    "    documents = loader.load()\n",
    "    transcript = ' '.join([doc.page_content for doc in documents])\n",
    "    return transcript"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "b9bd52eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using a regular string to store the youtube transcript data\n",
    "# Video selection will be important.\n",
    "# Parsing interviews is a whole other can of worms so I'm opting for one where Elad is mostly talking about himself\n",
    "video_urls = ['https://www.youtube.com/watch?v=nglHX4B33_o']\n",
    "videos_text = \"\"\n",
    "\n",
    "for video_url in video_urls:\n",
    "    video_text = get_video_transcripts(video_url)\n",
    "    \n",
    "    videos_text += video_text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba8befc0",
   "metadata": {},
   "source": [
    "Let's look at at sample from the video"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "48a3119e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I like to say that startups are an act of desperation and the desperation went out of the ecosystem over the last two or three years and we just had people showing up for the status and the money and now I think it's getting back to people who are doing it for a variety of reasons including the impa\n"
     ]
    }
   ],
   "source": [
    "print(video_text[:300])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "feb2238e",
   "metadata": {},
   "source": [
    "Awesome now that we have all of our data, let's combine it together into a single information block"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "879b32fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "user_information = user_tweets + website_data + video_text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "502e8be1",
   "metadata": {},
   "source": [
    "Our `user_information` variable is a big messy wall of text. Ideally we would clean this up more and try to increase the signal to noise ratio. However for this project we'll just focus on the core use case of gathering data.\n",
    "\n",
    "Next we'll chunk our wall of text into pieces so we can do a map_reduce process on it. If you want learn more about techniques to split up your data check out my video on [OpenAI Token Workarounds](https://www.youtube.com/watch?v=f9_BWhCI4Zo)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "61a563bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# First we make our text splitter\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=20000, chunk_overlap=2000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "4e98c609",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Then we split our user information into different documents\n",
    "docs = text_splitter.create_documents([user_information])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "e2c6799e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Let's see how many documents we created\n",
    "len(docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c2d766c",
   "metadata": {},
   "source": [
    "Because we have a special requset for the LLM on our data, I want to make custom prompts. This will allow me to tinker with what data the LLM pulls out. I'll use Langchain's `load_summarize_chain` with custom prompts to do this. We aren't making a summary, but rather just using `load_summarize_chain` for its easy mapreduce functionality.\n",
    "\n",
    "First let's make our custom map prompt. This is where we'll instruction the LLM that it will pull out interview questoins and what makes a good question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "835c6cd7",
   "metadata": {},
   "outputs": [],
   "source": [
    "map_prompt = \"\"\"You are a helpful AI bot that aids a user in research.\n",
    "Below is information about a person named {persons_name}.\n",
    "Information will include tweets, interview transcripts, and blog posts about {persons_name}\n",
    "Your goal is to generate interview questions that we can ask {persons_name}\n",
    "Use specifics from the research when possible\n",
    "\n",
    "% START OF INFORMATION ABOUT {persons_name}:\n",
    "{text}\n",
    "% END OF INFORMATION ABOUT {persons_name}:\n",
    "\n",
    "Please respond with list of a few interview questions based on the topics above\n",
    "\n",
    "YOUR RESPONSE:\"\"\"\n",
    "map_prompt_template = PromptTemplate(template=map_prompt, input_variables=[\"text\", \"persons_name\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48ef7a88",
   "metadata": {},
   "source": [
    "Then we'll make our custom combine promopt. This is the set of instructions that we'll LLM on how to handle the list of questions that is returned in the first step above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "f09dee43",
   "metadata": {},
   "outputs": [],
   "source": [
    "combine_prompt = \"\"\"\n",
    "You are a helpful AI bot that aids a user in research.\n",
    "You will be given a list of potential interview questions that we can ask {persons_name}.\n",
    "\n",
    "Please consolidate the questions and return a list\n",
    "\n",
    "% INTERVIEW QUESTIONS\n",
    "{text}\n",
    "\"\"\"\n",
    "combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=[\"text\", \"persons_name\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d58ea84",
   "metadata": {},
   "source": [
    "Let's create our LLM and chain. I'm increasing the color a bit for more creative language. If you notice that your questions have hallucinations in them, turn temperature to 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "583e4696",
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = ChatOpenAI(temperature=.25, model_name='gpt-4')\n",
    "\n",
    "chain = load_summarize_chain(llm,\n",
    "                             chain_type=\"map_reduce\",\n",
    "                             map_prompt=map_prompt_template,\n",
    "                             combine_prompt=combine_prompt_template,\n",
    "#                              verbose=True\n",
    "                            )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5645594d",
   "metadata": {},
   "source": [
    "Ok, finally! With all of our data gathered and prompts ready, let's run our chain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "9cde3021",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Warning: model not found. Using cl100k_base encoding.\n"
     ]
    }
   ],
   "source": [
    "output = chain({\"input_documents\": docs, # The seven docs that were created before\n",
    "                \"persons_name\": \"Elad Gil\"\n",
    "               })"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "b4f49991",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1. As an investor and advisor to various AI companies, what are some common challenges you've observed in the industry, and how do you recommend overcoming them?\n",
      "\n",
      "2. Can you elaborate on the advantages of bootstrapping for AI startups and share any success stories you've come across?\n",
      "\n",
      "3. What are some key lessons you've learned from your experiences in high-profile companies like Twitter, Google, and Color Health that have shaped your approach to investing and advising startups?\n",
      "\n",
      "4. How do you think AI will continue to shape the job market in the coming years?\n",
      "\n",
      "5. What motivated you to enter the healthcare space as a co-founder of Color Health, and how do you envision the role of AI in improving healthcare outcomes?\n",
      "\n",
      "6. Can you share some insights on what sets high growth companies apart from others and the key factors that contribute to their rapid growth?\n",
      "\n",
      "7. How do you evaluate the defensibility of AI startups when considering investment or advisory opportunities?\n",
      "\n",
      "8. What excites you the most about the future of AI, and what challenges do you foresee in its development and implementation?\n",
      "\n",
      "9. Can you share your vision for Color Health and how it aims to revolutionize the healthcare industry?\n",
      "\n",
      "10. What were the key challenges you faced during the rapid growth of Twitter, and how did you overcome them?\n",
      "\n",
      "11. What advice would you give to founders looking to build defensibility into their startups from the beginning?\n",
      "\n",
      "12. Can you share an example of a company that has successfully maintained a user-centric focus and how it has contributed to their success?\n",
      "\n",
      "13. How do you see the balance between serving customer needs and building defensibility evolving in the future of AI-driven products and services?\n",
      "\n",
      "14. Can you elaborate on the factors that contribute to your prediction of 2023 being a rough year for mid to late-stage private technology companies and how startups can prepare for these challenges?\n",
      "\n",
      "15. What do you think are the most promising applications of large language models like GPT in the near future, and how can startups leverage them for growth?\n",
      "\n",
      "16. How do you see the open versus closed structure playing out in the AI industry, and what implications could it have for startups and established companies in the AI space?\n",
      "\n",
      "17. How do you think the costs involved in training large language models like GPT-3 and GPT-4 will affect competition and innovation in the AI industry, particularly for startups with limited resources?\n",
      "\n",
      "18. What do you think are the key factors driving growth in the space and defense technology sector, and what opportunities do you see for startups in this industry?\n",
      "\n",
      "19. How do you envision the future of defense tech startups, and what challenges do they need to overcome to succeed in this competitive landscape?\n",
      "\n",
      "20. What lessons can other startups in the defense sector learn from Anduril's success, and how can they apply these strategies to their own businesses?\n"
     ]
    }
   ],
   "source": [
    "print (output['output_text'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73d84509",
   "metadata": {},
   "source": [
    "Awesome! Now we have some questions we can iterate on before we chat with the person. You can swap out different sources for different people.\n",
    "\n",
    "These questions won't be 100% 'copy & paste' ready, but they should serve as a really solid starting point for you to build on top of.\n",
    "\n",
    "Next, let's port this code over to a [Streamlit](https://streamlit.io/) app so we can share a deployed version easily"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
